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Abstract 

In this work, we analyze the generalization ability of distributed online learning algorithms under stationary and non-stationary 



"environments. We derive bounds for the excess-risk attained by each node in a connected network of learners and study the 
performance advantage that diffusion strategies have over individual non-cooperative processing. We conduct extensive simulations 
CN to illustrate the results. 
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il. Introduction 

Stochastic gradient algorithms provide powerful and itera- 
tive techniques for the solution of optimization problems In 
many situations of interest, the objective function is in the form 
t)f the expectation of a convex loss function over the distribu- 
tion of the input data. Such situations arise in machine learning 
I applications, where the input data are features to a classifier and 
their associated class labels. For example, the goal of a binary 
classifier is to predict the label (+1) given a vector of features 
that describes an observation (or, equivalently, to separate two 
classes based on their feature vector descriptions). The clas- 
sifier achieves this goal by learning a classification rule based 
on a cost function that penalizes incorrect classification accord- 
ing to some criterion. The cost function is usually referred to 
as the risk 0, p. 20], and it measures the generalization error 
that is achieved by the classifier (that is, it measures how well 
a classifier is able to predict the labels associated with feature 
vectors that have not yet been observed). The excess-risk is 
defined as the difference between the risk achieved by the clas- 
sifier given its classification rule and the smallest risk achiev- 
able by the classifier over all possible classification rules. It is 
critical to study the excess-risk performance of a classifier in 
order to understand how the classifier will perform on future 
data compared to the best possible classifier. 

Several works in the literature study excess-risk indirectly 
by deriving regret bounds and then relating these bounds to 
excess-risk ijsllj]. This two-step procedure suffers from two 
drawbacks: 1) the procedure is targeted at algorithms that uti- 
lize diminishing step-sizes, which are not useful for non-sta- 
tionary environments, and 2) the second step that relates the 
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regret to excess-risk is not tight, and it has been shown that on- 
line learning algorithms that utihze diminishing step-sizes can 
achieve better performance than dictated by the indirect anal- 
ysis jsj]. In this article, we study the excess-risk directly and 
for constant step-sizes in order to cope with non- stationary en- 
vironments. Among other results, we establish that a constant 
step-size distributed algorithm of the diffusion type can achieve 
arbitrarily small excess-risk for appropriately chosen step-sizes 
in stationary environments. 

Distributed stochastic learning seeks to leverage coopera- 
tion between nodes over a network in order to optimize the 
overall network risk without the need for coordination or su- 
pervision from a central entity that has access to the entire data 
(like features and labels) from all nodes. Such distributed sche- 
mes are particularly useful when the data sampled by the nodes 
cannot be shared broadly due to privacy or communication con- 
straints. In fl, an algorithm is developed that requires a central 
node or server to poll the optimization estimates from all the 
nodes at the end of a time horizon. This approach is not fully 
distributed and is not able to track changes in the generating dis- 
tribution without restarting the algorithm. One fully distributed 
learning algorithm appears in [7J where the global cost is cho- 
sen as the aggregate regret over the network of learners. The 
scheme of [7] consists of a single consensus-type iteration of 
the form ( |20l l further ahead and is similar to the schemes pro- 
posed in 181] for distributed optimization; the analysis in [ 8] is 
limited to the noise-free case. In the estimation literature, ref- 
erences 1 9, 10, 11] proposed distributed schemes that rely on 
diffusion rather than consensus iterations. Diffusion strategies 
allow for information to diffuse more readily through the net- 
work, and they enhance stability, convergence, and robustness 
in comparison to consensus strategies [|l2ll . Diffusion strategies 
consist of two steps: a combination step that averages the esti- 
mates in the local neighborhood of an agent, and an adaptation 
step that incorporates new information into the local estima- 
tor of each agent. The net result is that information is diffused 
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across the network adaptively and in real-time. The diffusion 
approach was generalized in ifisl for general strongly convex 
cost functions and constant step-sizes. 

In comparison to the earlier work on diffusion adaptation 
10, [13, Hi ^ we study in this work the excess-risk performance 
of general strongly-convex risk functions as opposed to mean- 
square error performance. This level of generality allows us to 
study the excess-risk performance for regularized logistic re- 
gression in addition to the delta rule (square loss). In com- 
parison to II13I I15II . we study the tracking performance of the 
distributed classifiers when the optimizer is time-varying. The 
effective tracking of a drifting concept is only possible when 
the algorithm utilizes a constant step-size as opposed to dimin- 
ishing step-sizes as used in 10, Hit]. Even for stationary environ- 
ments, we show that the proposed algorithms obtain better per- 
formance than the non-cooperative solution for any strongly- 
convex risk function when some mild assumptions hold regard- 
ing the noise process. 

One of the main objectives of this work is therefore to study 
the generalization ability of diffusion strategies when the dis- 
tribution from which the data arises is time-varying. When the 
statistics of the input to the classifier change, the classifier must 
adjust its classification rule in order to accurately classify the 
data arising from the new distribution — see Fig. |3] further 
ahead. In the context of machine learning, this change in the 
best possible classification rule for the non- stationary data is re- 
ferred to as concept drift {l6. 17, 18]. We desire to answer one 
key question: is it possible to obtain convergence results for the 
distributed learning algorithms to show tracking of a changing 
optimal classification rule? We discover that the answer is in 
the affirmative under some assumptions. 

2. Problem Formulation and Algorithm 

It is assumed that a classifier receives samples x, over time 
i arising from some underlying statistical distribution. A loss 
function Q(w, xi) is associated with jc, and depends on an M x 1 
parameter vector w. The classifier wishes to minimize the risk 
function over w, which is defined as the expected loss f^, p. 20]: 



J{w) - Kjci{Q(w,Xi)] (risk function) 



(1) 



It is usually assumed that the data x, are independent and identi- 
cally distributed (i.i.d.). It is also assumed that the risk function 
J{w) is strongly convex. Obviously, when the data are station- 
ary, then the risk J(w) will not depend on time. Observe that 
we are denoting random quantities by using the boldface nota- 
tion, which will be our convention in this article. One example 
that fits into this formulation is logistic regression lll9i p. 1 17] 
where the cost function is defined as: 

J(w) ^ E„„ft,| {^||w||2 + logd + e-y^'^)^ (2) 

where jc, in ([T]i is now defined as the aggregate data [y,, /i,) 
where j, denotes the scalar label for feature vector /i, e R'^. 
Moreover, p is a positive scalar regularization parameter We 
will utilize logistic regression in the simulation section to illus- 
trate our analysis. 




Figure 1 : A connected network. The shaded region represents the neighborhood 
of node 1 . 



In order to assess and compare the performance of algo- 
rithms that are used to minimize ([1}, we adopt the excess-risk 
(ER) measure, which is defined as follows: 



ER(0 = E{7(h',-i ) - J(w")] (excess-risk) 



(3) 



where w° is the optimizer of ([TJ over all w in the feasible space: 



w° = argmin J(w) 



(4) 



and Wi-i is the estimator for w" available at time / - 1. The rea- 
son why the excess-risk is evaluated using Wj-i is that excess- 
risk measures the generalization ability of a classifier on future 
data before observing the data. The estimate Wi, as we will see 
in Alg.[Tl would incorporate data from time /. The variable Wi-\ 
is generally a random quantity since it will be influenced by 
randomness in the data arising from the gradient vector approx- 
imations that are used during the development of stochastic gra- 
dient procedures; the gradient approximations are referred to as 
instantaneous approximations in the adaptive literature (2^ 21], 
and are also sometimes called the gradient oracle in the machine 
learning literature (see, e.g., ^ |22|]). The expectation in (O is 
taken over the distribution of w,_i. 

Considerable research has focused on deriving bounds for 
the excess-risk in gradient descent procedures for stand-alone 
classifiers. In this work, we pursue two extensions to these re- 
sults. First, we assume that we have a network of learners 
connected by means of some topology. The only requirement 
is that the network be connected, meaning that there is a path 
connecting any two arbitrary agents in the network; this path 
may be through a sequence of other agents. Figure [Tjillustrates 
one such network. The nodes in the shaded region represent the 
neighborhood of node 1 (denoted hy N\). Second, we allow the 
statistical distribution of the data jc, to change with time. This 
change causes the optimizer w" to drift. 

We associate with each agent k in the network an individual 
loss function Qk{w,Xk4) evaluated at the corresponding feature 
vector, Xk,i- The corresponding strongly-convex risk function is 
generally time-varying and given by: 



Jk,iiw) = ^xt.,{Qkiw,Xk.i)] 



(5) 



We further consider a global network risk, which is defined as 
the average of the individual risks over all nodes: 



1 ^ 

jf°^{w) — ^ ^ Jk,iiw) (network risk function) 



(6) 



The excess-risk at node k is defined as: 

ERi:(;) =V.{Ji;j{Wkj-\) - -/yt,/(w")) (excess-risk at node k) (7) 
while the network excess-risk is defined as: 



1 ^ 

ER(0= -XER-t(0 



k=i 
if 



= 1E<! ■^2^Jk.i{'^kj-i) - jf°^{w1) J> (network excess-risk) 



where in both cases 



Wj = arg mm J° (w) 



(8) 



(9) 



When the distribution is stationary and = 1 (i.e., a network 
with a single node), we see that our formulation collapses to 
the one described by ([TJ and Q. We assume that the optimizer 
w° in (|9j is also the optimizer of the component risk functions 
JkAw), i.e.. 



arg min jf°^{w) = arg min Jkj(w), k - 1,2, . . .,N 



(10) 

This condition is satisfied when the nodes are sampling data 
arising from a time-varying distribution defined by the same set 
of parameters. That is, when the data do not reflect local pref- 
erences, then ( [Tol l is usually satisfied. When the environment is 
stationary and w° is therefore constant, reference lll3ll derived 
the distributed algorithm listed in the table below for the solu- 
tion of (|9]l. One of the objectives of this work is to show that 
this same algorithm can also be used to track drifting concepts 
w°. We will evaluate how well it performs in this case. 

In Alg.[Tl each node k interacts with its one-hop neighbors 
and updates its parameter estimate using approximations for the 
true gradient vector The coefficients a-i^a, cck, and a2,« are 
non-negative scalars corresponding to the {{, k) entries of NxN 
matrices Ai, C, and A2, respectively. In view of the requke- 
ment (fTTI) . the matrices Ai and A2 are left- stochastic while the 
matrix C is right-stochastic. Different choices for {Ai,A2,C} 
lead to different variations of the algorithm. For example, set- 
ting A 1 = / and A2 = A leads to an Adapt- then-Combine (ATC) 
strategy where the first step is an adaptation step, followed by 
combination: 



N 

Wkj = ^aaiAf,/ 



(ATC) (15) 



(=1 



On the other hand, setting Ai - A and A2 = I leads to a 
Combine-then- Adapt (CTA) strategy where adaptation follows 
combination: 



(CTA) (16) 



Wk,i 



= (Pk,i-i 



- fi'^cek^J{j-ii4>k,i-i) 



Algorithm 1 : Diffusion strategy for risk optimization 

Consider the problem of optimizing the network risk function l|6ll 
in a distributed manner. For each node k, let A/i denote the set of 
its neighbors, namely, all nodes with which node k can share in- 
formation (including node k itself). Select non-negative coefficients 
{a\ [k\, {c[k\, and {fla.ftl that satisfy 



{2 Cl\.tk - 2 t'Af - 2 CllSk — 1 
CeNt feWi tiNt 

a\,ck = £■« = a2jk = 0, when f t Nk 



(11) 



Each node k starts with an initial weight estimate Wks and repeats 
over ; > 1 : 



l.ftVl'f.i-l 



= Ipki-l -/i CgVJg-li^k.i-l) 



N 

(=1 



(12) 
(13) 
(14) 



where VJ[j-i(-) is an approximation for the true gradient vector 
Vyf,,-i(-)> and yu is a positive step-size parameter. 



In either the ATC or CTA versions, we can further set C - In. 
In this case, the adaptation step would rely only on the gradient 
vector at node k, e.g.. 



(A/t,/ = WkJ-l - lj'^Jk,i-l(Wk,i-l) 

N 

Wkj = '^atkiffcj 



(Pk.i-\ 
Wk,i 



(ATC) (17) 



(CTA) (18) 



<f>k,, 



i-l 



■ IJ^Jk,i-l(4'k,i-l) 



Likewise, setting Ai -A2-C - In leads to the non-cooperative 
mode of operation where each node optimizes its risk individu- 
ally and independently of the other nodes: 

Wk,i^Wkj-i-iJ-'^Jk,i-i(wk,i-\) (no cooperation) (19) 

It is important to note that diffusion strategies are different in a 
fundamental way from the algorithm presented in iItHII, which 
has the form: 



Wk,i 



2 



aawej-i - M'VJkj^i(wk,i-i) 



(20) 



For instance, comparing with (fTST i, we see that one critical dif- 
ference is that the gradient vector used in (l20t is evaluated at 
Wkj-\, whereas it is evaluated at (pkj-i in (flST l. In this way, in- 
formation beyond the immediate neighborhood of node k influ- 
ences the updates at k more effectively in the diffusion case (fTST l. 
This order of the computations has an important implication on 
the dynamics of the resulting algorithm. For example, it can 
be verified that even if all individual learners are stable in the 



mean-square sense, a network of learners using an update of the 
form (l20t can become unstable, while the same network using 
the diffusion updates (fT7ll-(fT8ll will always be stable regardless 
of the choice of the matrix A — see lilZ] . In the next section, we 
establish a relationship between excess-risk and mean-square- 
error (MSE) and provide the main assumptions for the rest of 
the manuscript. 

3. Excess-risk, Weighted MSE, and Main Assumptions 

Introduce the prediction and filtering weighted mean-square- 
errors (MSEs): 



(21) 
(22) 



where ||x||^ = x^Tx for any positive semi-definite weighting 
matrix T. When the environment is stationary (i.e., when w° = 
w" for all /), we notice that there is effectively no difference 
in the filtering and prediction MSE in steady-state. The rea- 
son why we need to introduce the two errors is that the excess- 
risk dHJ requires that a previous estimate (prior to observing the 
current data) is used to evaluate the performance of the classi- 
fier We will see shortly that under the non-stationary model we 
adopt in this work, the prediction and filtering MSEs are related 
to each other To proceed, we introduce the following assump- 
tion regarding the Hessian matrices of the functions Jkj{w). 

(Assumption 1) The Hessian matrices of the individual risk 
functions Jk,i{w) are uniformly bounded from below and from 
above for all k e { 1, . . . , A^) and time i: 



(23) 



□ 



where < 

Assumption [T] essentially states that the risk functions encoun- 
tered for all times and at all nodes can be upper and lower- 
bounded by a quadratic cost. The lower-bound on the Hes- 
sians in ( |23] ) translates into saying that the functions Jkjiw) are 
strongly-convex fT, pp. 9-10]. For example, the risk function 
(|2]i for regularized logistic regression satisfies Assumption[T] 

We may note that in jsl |23t |24t], the risk functions are as- 
sumed to have bounded gradient vectors (as opposed to bounded 
Hessian matrices). Clearly, there are cost functions (such as 
quadratic cost functions) where the gradient is not bounded 
while the Hessian is; for this reason, our Assumption [T]enables 
the subsequent analysis to be applicable to a larger class of risk 
functions. 

Now consider the excess-risk suffered at iteration / at node 
k. It can be expressed as: 

ERi(0 = E{7,,,(Wi,,-i)-7,,,«)) 
-I 



r vjk,i(wi-tK:^''dtKi 

Jo 

'^Ei^j\jk,i(wlfdtwl.+ 



-pi 



ft f^^Jkj{w1-stw1.)dsdt 
Jo Jo ' ' . 



(c) ^ J -pi 



/''/ 

Jo Jo 



V^/i i(w° -stWk i)dsdt 



= E{|K_,||2 



■^m\<:,P i|2 



where 



A o 

= W: 



yvk.i-\ 



(24) 
(25) 

(26) 



Steps (a) and (b) in the sequence of calculations that led to ( |25]) 
are a consequence of the following mean-value theorem from 
Qp. 24]: 

f(a + b) = f{a) + r V/(fl + t-b)^ dt-b (27) 
Jo 

Step (c) is a consequence of the fact that w° optimizes Jkj{w) 
so that WJkj(w°)- 0. Step (d) is due to (|23]) . where we defined 
the weighting matrix as: 



k,i — 



/''/' 

Jo Jo 



Jk/w" - stw'i^.)dsdt 



(28) 



It follows from ( l25T l that if the MSE at all nodes is uniformly 
bounded over time, then the network excess-risk (|8) will be 
bounded by the same bound scaled by /lmax/2. For this reason, it 
is justified that we examine the mean-square-error performance 
of the diffusion strategy (fT2li-(fT4ll under stationary and non- 
stationary conditions, and then use these results to bound the 
network excess-risk by using the relation: 

ER(0 = ^|;e{||h-^.||^J = e{||<||^J (29) 



i-=i 



where 

^col{H'^;,.,H'^_.,...,<,) (30) 
collects the prediction weight error vectors across all nodes, and 



Ti = —diag{Tij,...,TNj] 



(31) 



A key point to stress here is that the network excess-risk is the 
weighted network MSE when the weighting matrix is set to the 
above 7",. In order to perform the mean-square-error analysis 
of the network, we need to introduce some assumptions. First, 
we introduce a modeling assumption regarding the perturbed 
gradient vectors used by the algorithm. 

(Assumption 2) We model the perturbed gradient vector as: 

V.J(H') = V,J(H') + Vkjiw) (32) 

where, conditioned on the past history of the estimators {Wkj} 
for j < i — I and all k, the gradient noise Vkj(w) satisfies: 



'Avkjiwm 



i-i) 







(33) 



IE{||v*,/(H')||2) < a ■ Ellwf - wt + a-l 



(34) 



for some a > 0, o"^ > 0, and where "Hi-i = {Wk,j ■ k = 
\,...,Nandj<i-l}. □ 
Assumption |2] models the perturbed gradient vector as the true 
gradient plus some noise. This noise consists of two parts: rela- 
tive noise and absolute noise. The variance of the relative noise 
component depends on the distance between the estimate and 
the optimum at time / (w°). On the other hand, the variance of 
the absolute noise term is represented by the factor crl in ( l34l l. 
As the quality of the weight estimate by the node improves, 
the power of the relative noise component decreases. The sec- 
ond part of the noise bound in (|34] | refers to absolute noise; 
this component does not depend on the current weight estimate 
and is bounded by cr^. The absolute noise guarantees that there 
will always remain some perturbation on the estimated gradient 
vectors even when the gradient is evaluated at the optimum w°. 
We may remark that, in contrast to Assumption 2, most earlier 
references [5^ 23, 24J in the literature assumed only the pres- 
ence of the absolute noise term and ignored relative noise. The 
following example is from lll3ll . 

Example 1. Consider ADALINE (l^, p. 103], Let the 

binary class label at node k and time / be denoted by y^j e 
(-1,-1-1). Let the feature vector at node k and time / be denoted 
by hkj e R'*^. ADALINE optimizes the quadratic loss: 



Qk(w,yk,i,hkj) ^ k-.i ■ 



I T I- 



The risk function is then the expectation of the loss in 

Jk(w) = E \ykj - hljwf 
Let the data satisfy the linear model: 

ykj = hljw + Zk(i) 



(35) 



(36) 



(37) 



where the feature vectors {hiij} are assumed to be zero-mean 
with a constant covariance matrix Rh^k — ¥,{hkjhj.}. The noise 
sequence {Zkii)] is assumed to be zero-mean and white with 
constant variance cr^^,. The optimal solution w° that minimizes 



( |36] | satisfies the normal equations: 

niy,k = Rh,kW° 



(38) 



where r/,y_A. = 'K{hkjykj]- The feature vectors and noise are 
assumed to be independent over nodes and time. One instanta- 
neous approximation for the gradient vector is: 



VJkiw) = -2hkj(yk,i - hljw) 



(39) 



Using (l37Ti-(l39]l and ( |32] |. we have that the gradient noise satis- 
fies: 



Vk,i(w) = VJkiw) - VJk(w) 

= 2(Rh,k - hk^hl^iw" -w)- 2hk,iZk(i) 



We then have that: 

nvkM-i) = 



(40) 



(41) 



E\\VkAw)f<4E {(cr,„ax {Rh,k - hkjhl,)J 
4Tr(«,a)cr? 



■E||w''-H'|| 



(42) 



for all w e 'Hi-i where o-^^^A) denotes the maximum singular 
value of its matrix argument A. Therefore the ADALINE algo- 
rithm satisfies Assumption |2] under (l37t . Note that both noise 
terms (relative and absolute) appear on the right hand side of 
(US. □ 

We shall distinguish between two scenarios in our analysis. 
In the first case, we assume the optimizer w" does not change 
with time (i.e., we assume stationarity). In the second case, we 
assume the optimizer w" varies slowly with time according to a 
random walk model. 

(Assumption 3) The data process Xj is stationary. This implies 
that the risk functions Jk,i(w) defined in ^ are time-invariant: 



Jkjiw) - Jk(w), for all i 



(43) 



In addition, this implies that the optimizer w° of the network risk 
function and all individual risk functions is constant, w" — w° 
for all i. □ 
When the environment is non-stationary, we shall assume in- 
stead a random-walk model for the minimizer w°. 

(Assumption 4) In the non-stationary case, the time-varying 
optimal vector w° is modeled as a random walk: 



(44) 



where the zero-mean sequence q, has covariance E{qiqJ} = Q 
and is independent of the quantities {Vk(Wkj), qj} for all j < i. 
The mean ofw° is set to E{w?) - w° . □ 

Observe that the time-varying optimizer w° is now denoted 
by a boldface letter due to the addition of the random noise 
component furthermore, the expectation in the definition of 
excess-risk ([8]l will now operate over this randomness as well. 
In machine learning, this random- walk model was used in [ isj 
to describe the concept drift of a classifier with a moving hy- 
perplane. This model is also commonly used to evaluate the 
tracking performance of adaptive filters [21 , pp. 271-272]. 

Assumption|4]models the desired set of parameters w" as a 
non-stationary first-order autoregressive (AR(1)) process. Such 
AR(1) processes are commonly used to model non-stationary 
behavior in various contexts such as adaptive filtering Il2lll and 
financial data modeling |26, pp. 142-146], 1271 pp. 72-73]. 
Similar models have been used in other contexts such as web 
searching. For example, the original PageRank algorithm, used 
by the Google search engine, uses a naive "random surfer" that 
models an average user that traverses a random walk over the 
graph of Internet webpages |28]. Although the model is sim- 
plistic in terms of modeling the shifts of a user's interest, it has 
been demonstrated to achieve excellent page sorting capability. 

Given Assumption|4] we can relate the prediction and filter- 
ing errors introduced in (I2ll -(l22t: 



EiiH-i;, 



ElK-H-i.nll^ 



5 



MwU -y^k,i-i\\\+nqi\\l 

lEII^^IIr + Tr(er) 



(45) 



This means that in order to show that the prediction error E| Ih*!^ . 1 1^ 
remains bounded for the diffusion algorithm, it is sufficient to 
analyze the filtering error EllvP^ .||^ and show that it is bounded. 

We will further introduce an assumption to be used later in 
the article to derive relationships between the performance of 
the algorithms described by (fT5t-(fT9^. 

(Assumption 5) The risk functions across the nodes are identi- 
cal: 



,N] (46) 



□ 

Assumption Instates that all nodes have the same risk function, 
but this does not mean that the nodes will receive the same data 
realizations. Assumption |5] is satisfied when the nodes utilize 
the same loss function Q(-, ■) and receive data arising indepen- 
dently from the same distribution. 

4. Stationary Environments 

In this section, we focus on obtaining convergence results 
for the excess-risk for the distributed diffusion strategy (fTSl l- 
( fT4b under stationary conditions. First, we show that the dif- 
fusion algorithm can achieve arbitrarily small excess-risk given 
appropriately chosen step-sizes. 

Theorem 1 (Excess-risk for stationary environments is 0{fj.)). 
Let Assumptions [7]|i] hold. Given a small constant step-size n 
that satisfies: 



< f2 < min < 



2A„ 



2A„ 



Ai,.,, -HQ- A ■ -i- a \ 

max fniii i 



(47) 



Then, Algorithm\l\achieves arbitrarily small excess-risk at each 
node k, i.e.: 

limsupERi(0 < e (48) 



where e is defined as: 



We now appeal to results from lll3ll (Theorem 1, Equations (67) 
and (72)) where it is shown that for ju satisfying (l47b it holds 
that 



hmsupE,^||H'<:_, 



i-ill 



2A. 



ke\\. 



(51) 



Therefore, if we define e as in ( l49l l and further bound (ISOl l using 
dSTT i. we obtain (|48]l. □ 

Result ( [soi l implies that when the environment is station- 
ary, meaning the optimizer w° is actually fixed for all time, 
then the excess-risk attained at each node in the network will 
be bounded by an arbitrarily small quantity that is proportional 
to the step-size ji when WT\ is satisfied. As we will see in the 
next section, this arbitrary reduction of the MSE is not generally 
possible for non-stationary environments. 

In addition to Theorem [1] it is possible to approximate the 
excess-risk at node k (and also the network excess-risk) at steady 
state for sufficiently small /i. 

Theorem 2 (Steady-state approximation for excess-risk). 
Let Assumptions\TS3\hold. For small step-sizes that satisfy ( I47l l. 
the steady-state network excess-risk from ( |29t for Als.\l\can be 
approximated by: 

Um ER(0 ^ /z^vec {J{['RI,J{2^ (I - T)"' vec(71 (52) 



where 



J{2=A2® Im 



-A — ■-, — ^^ 



(49) 



(53) 
(54) 
(55) 
(56) 

(57) 

and the symbol ® denotes the Kronecker product operation 
p. 139] and vec(-) refers to the operation that stacks the columns 
of its matrix argument on top of each other [29, p. 145]. Fur- 
thermore, the matrix in ( 1521) is defined as the covariance 
matrix of the vector gi: 



£) = ^ diag . . . , ci^n] ® V27f(w°) 
T - -]-diag {vVKw"), . . . , V^JNiW)} 



and is directly proportional to the step-size p.. Since each node 
can achieve an arbitrarily small excess-risk, the network excess- 
risk in ([8]l can also be made arbitrarily small. 

Proof. Given Assumption [3l we have that the risk functions 
Jkjiyv) are time-invariant {Jkjiyv) - Jkiw)) and the optimizer 
is constant (w° - w") for all time /. Furthermore, we have 
from (l25T l that the excess-risk at node k is bounded by the scaled 
mean-square-error: 



EMwkj^i) - Jkiw")} < ^E„,||%„ 



i-il 



(50) 



gi ^ colicnvc/w"), ceMVt/w")} (59) 
That is, %, = E{g,gf ). 

Proof. From ( |29] l, we notice that the excess-risk can be evalu- 
ated as the weighted mean-square-error wiffi weight matrix 7", 
defined in (|3TI ) and ( l28l l. When the environment is stationary 
and - w" is constant, the weight matrix Tkj in (l28T l becomes: 



k,i = 



f t f V-Jkiw" - s tWk,i-i)dsdt 
Jo Jo 



(60) 



Furthermore, due to Theorem [T] we have that the mean-square 
value of Wk,i-i is small for small step-size fi and large /. This 
implies that we can approximate the weight matrix T^j by 



f'f 

Jo Jo 



V^Jk,i{w")dsdt 



1 



VVi(w") (small ju) 
(61) 



for large / and small yu. In other words, the matrix Tk^ becomes 
approximately deterministic and is given by Tk at steady-state. 
Therefore, the matrix 7~,- defined in (ISTT i can, in steady-state, be 
approximated by the deterministic matrix: 



= ^diagiri, 



(62) 



We can now utilize results from 111311 to approximate the excess- 
risk at steady-state. Using (103) from lll3ll we can write: 



(63) 



where S is an arbitrary positive semi-definite matrix that we are 
free to choose. Assume we choose S such that 



2 - = T 



(64) 



for some T, which could be equal to (l62T i or some other choice 
(see Table [1). If we stack the columns of X into a vector cr = 
vec(2), then the above equality implies that cr is chosen as 



o- = (J-?^)-'vec(r) 



(65) 



The matrix (I — T^ is invertible for sufficiently small step-sizes 
(see App. C in il3h . Therefore, we conclude from (l6Jt that 



hm EIIh-zII^ * yu^vec {^(['RIMi^ (I - rr^wec(T) (66) 



Different choices for T are possible in ( |66] |. For example, if we 
select T as in ( |62] |. then ( |66] | would approximate the network 
excess-risk (jSj at steady-state. Table [T] lists other choices for 
T. □ 

Different metrics can be evaluated by choosing T appropri- 
ately. For instance, in order to evaluate the mean-square-error 
at node k, we let T - Ekk where Ekk is the zero matrix with 
a single 1 in the A:-th diagonal element. On the other hand, in 
order to evaluate the excess-risk at node k, we let T - Ekk ® Tk 
where Tk = 3 W^(w"). 

It is possible to compare the performance of Alg. [T]against 
that of non-cooperative processing (fT9T l when the nodes act in- 
dividually and do not cooperate with each other The non- 
cooperative case ( fT9l l is a special case of Alg. [1] when the ma- 
trices {Ai, A2, C] are all set equal to the identity matrix. 

Theorem 3 (Cooperation versus no-cooperation). 
Let Assumptions\T^hold. In addition, let Assumption\5\hoId 
so that all nodes have the same risk function. Assume the step- 
size satisfies condition ( 1471 ). Consider the ATC, CTA, and the 
non-cooperative algorithms (I17l )-(|19l) with C — I. Assume the 
combination matrix A in the ATC and CTA cases is chosen to be 



doubly-stochastic, meaning that A^ \ — 1 and At = 1. When 
Assumption \5} holds, the weighting matrix T in (162b has the 
form T — jfjlN ® V^7(w°). Under these conditions, the steady- 
state network excess-risk satisfies: 



ERatc < ERcTA < ER 



ind 



(67) 



where ERatc is the steady-state excess-risk when the ATC al- 
gorithm is executed, ERcta is the steady-state excess-risk when 
the CTA algorithm is executed, and ERind steady-state excess- 
risk when the nodes do not cooperate with each other 



Proof. See [Appendix A 



□ 



From Theorem|3] we observe that the Adapt-then-Combine 
(ATC) algorithm outperforms the Combine-then-Adapt (CTA) 
strategy, which in turn outperforms the non-cooperative strat- 
egy for any doubly-stochastic combination matrix A. The rea- 
son ATC outperforms CTA is because adaptation precedes com- 
bination in ATC so that improved weight estimates are aggre- 
gated in the combination step. Nevertheless, as the step-size 
becomes smaller, then the gap between the ATC and CTA algo- 
rithms also becomes smaller (see Fig. l4cBdl further ahead). 

In the next section, we study the performance of the diffu- 
sion strategy (fT2ll-(fT4li when the optimizer w" is changing ac- 
cording to Assumption!?] We will estabhsh that the excess-risk 
is bounded even under this scenario. 

5. Non-Stationary Environments 

In the previous section, we showed that if we use a con- 
stant step-size, the mean-square-error and network excess -risk 
for Alg. [T]can be made arbitrarily small by choosing the step- 
size to be sufficiently small. However, reduction of the excess- 
risk is not always possible in non-stationary environments. In 
order to arrive at meaningful bounds for the tracking perfor- 
mance of the algorithm, we will utilize the random-walk model 
from Assumption]?] 

Theorem 4 (Asymptotic ER bound for non-stationary data). 
Let Assumptions\I]i2\and^hold, and choose a constant step- 
size that satisfies, as i — » 00: 



Q<fi< 



IICII (^Lx + a) 



(68) 



where \\C\\i represents the maximum absolute column sum of 
the matrix C, while C, represents the miminum absolute column 
sum of the matrix C. The asymptotic excess-risk at node k then 
satisfies: 



ERkii) < 



\\C\\\ctIA^. 



K Tr(g)'tmax -1 
-U -I U. -H 

^ AA ■ C 



-Tr(0 (69) 



Steady-state term 



Tracking term 



for all k — I, . . .,N. Since all nodes satisfy this bound, the 
network excess-risk, ER(/), is also asymptotically bounded by 
the right-hand-side of i 



1 



Table 1: Choice of T for the evaluation of different performance metrics. indicates the all zero matrix with a single 1 in the ir-th diagonal element. 



Metric 


{/^(H'i.oo) - Jk(yf)\ 


1 ^ 

-VEh, {Jk{Wk,oo)-Jk{w")] 


E„ [Wwk.ooW^} 


1 ^ 

-2e,{||h',,co||'} 

k=l 


T 


Ekk ® Tk 


^diag{ri,...,riv) 


Ekk 


1 T 



Proof. To show that the asymptotic excess-risk at node k is 
bounded, we observe that the excess-risk is asymptotically ap- 
proximated by the weighted mean-square-error (ISTt with weight 
matrix Tk given in (|6TI ): 



To bound the filtering error EUh'^.IP, from Appendix B we 
have the scalar recursion (IB. 29) : 

\m\L < m^olL + {\\C\\]cry + Tr(0) (73) 

J=o 

where ||x||oo denotes the maximum absolute entry of a vector x 
and 



ER,(0 ^ E||#^ "2 



k,im 



(70) 



Using ( l45T l. we see that the excess-risk can be written in terms 
of the filtering error: 



ER,(0<E||H'(,|||^,^^^„,+Tr(er,) 



(71) 



where iv^. = w° - Wkj and the inequality is a result of Assump- 
tion[T] We can use Assumption[T]to verify that Tr{QTk) is also 
bounded since: 



Tr(er,) = J]£]e„™rA,„ 



(a) 
< 



(MM \rMM \ 

A XI XI XI XI '^k.mn I 

) Vm=l «=1 / Vm=l rt=l / 



= 7Tr(e2)Tr(r2) 

= VTr(f/Q2f/T)Tr(yn2yT) 



f M \( ^ \ 

X ^n' X 

^m=l /Vm=l / 



< 



1 Vm=l / Vm=l / 



(Tr(0)2(Tr(r,))2 



W MA„ 



-Tr(0 

L 

where step (a) is due the Cauchy-Schwarz inequality, step {V) is 
due to the introduction of the eigenvalue decompositions Q = 
XJOXJ^ and Tk - YWV^ , where Q - diaglwi, . . . , wm) and 
n - diag{7ri, . . . ,71^) are the non-negative eigenvalues of the 
symmetric matrices Q and Ti, respectively. Step (c) is due to 
Q and Tk being non-negative definite, and step {d) is due to As- 
sumption [1] This means that the excess-risk at node k (TTTI ) can 
be upper-bounded by 



ER,(O<^E||H'(,||2 + ^Tr(0 



(72) 



W,-^[E||w{,.||\...,E||.v^. 



rJ i|2iT 



(74) 



/? ^ 1 - 2Aii„i„C. + ii\xl,,, + a)\\C\\\ (75) 

Notice that when the constant step-size fi satisfies ( |68] ). we have 
that p < I. Therefore, we can evaluate the limit of the geometric 
series in the second term of (iTJt as 



||C||ftr^A''+Tr(0 



(76) 



Additionally, the limit of the first term on the right-hand-side of 
(l73T l will be zero since /3 < I. Therefore, we have that 



limsup||nV,|U < 



\\C\\](Ty + Tr(0 



Tr(e) 



1 -yS 

— 1 

2/inii„C. -yu(/iLx+a)IIC|li 2fiA^inCt-fJ-H^'i.^x+a)\\C\\l 

(77) 

For sufficiently small step-sizes, the denominator of the first and 
second terms of (fTTT i can be respectively approximated by 

2A^i,a - A^(4ax + ^ 2^™„C. (78) 

2fiA^i,C. - i?(Ai.^, + a)\\C\\l * 2jU/i™„C. (79) 

Therefore, we conclude that (177^ can be approximated for small 
step-sizes by 

\\C\\]crl Tr(0 , 
lim sup < jY^fi + TT^I^ (80) 

Noting the relationship between excess-risk and the mean square 
error in (l72t . we have that the excess-risk at node k is bounded 
by 

\\C\\W.A 

max Tr(2)/lmax -l ^^^max t, / ^\ /oi\ 

ER^O) ^ ^1 — 7^/^ + ^1 — p^/^ + — ^Tr(0 (81) 

and therefore the network excess-risk ER(/) satisfies this bound 
as well for sufficiently large / and small ju. □ 



8 



Steady-state term (69) 

Tracking term (69) 

' I- — 1 Upper-bound (69) 
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\ 
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— 
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Figure 2: Trade-off between tracking performance and steady-state excess-risk. 
The scalar /i° indicates tlie optimal choice for the step-size in order to minimize 
the bound on the excess-risk. 



Consider the case where C - I^- We observe from 
that a trade-off exists between the steady-state performance of 
the algorithm and its tracking performance. The bound con- 
sists of the sum of the steady state excess-risk (l48l l derived 
for stationary environments and a term that depends on yU ' 
and which arises as a result of the random-walk model noise 
qi. To decrease the steady-state error, we would need to use a 
smaller step-size, which affects the tracking performance ad- 
versely. Figure |2] illustrates this trade-off. In the figure, ji" 
indicates the optimal choice for the step-size in order to min- 
imize the bound on the right-hand-side of ( |69] l. The figure 
gives insight into the fact that a small step-size will improve 
the steady-state performance when the environment is station- 
ary, but will harm the tracking ability of the algorithm when the 
environment is non-stationary. We conclude that the asymptotic 
network excess-risk (jSj remains upper-bounded by a constant, 
even when the optimizer changes according to a random-walk. 
That is, even as the variance of the random process generat- 
ing w° grows indefinitely, the excess-risk at each node remains 
bounded. 

In order to illustrate the application of the result in the con- 
text of machine learning, we consider a linear binary classifi- 
cation problem where the task is to find a hyper-plane (through 
the origin) that best separates features from two classes accord- 
ing to some cost function (such as the logistic regression cost 
in (|2])). Since the hyper-plane is fixed at the origin, the task 
is to find the best rotation of the hyper-plane to separate the 
data. Consider now that the distribution from which the fea- 
ture vectors arise is time varying and as a result the optimal 
hyper-plane must rotate accordingly — see Fig. |3] Our anal- 
ysis shows that the diffusion algorithm can track the random- 
walk rotating hyper-plane proposed in [18] and remain within 
a constant excess-risk on average for any strongly-convex cost 
function used that satisfies Assumption[T] 



6. Simulation Results 

6.1. Stationary Environments 

In this section, we test the distributed diffusion strategy (fTSl l- 
(fT4l l on three stationary datasets: 



• The 'alpha' dataset |13C 

• The 'a9a' dataset flU. 

• The 'webspam' (unigram) dataset jsll]. 

Each set deals with a binary classification problem. The dataset 
properties are compiled in Table |2] We split the data evenly 
across the nodes with the step-size chosen so that it is possi- 
ble to observe the steady-state behavior Unfortunately, since 
some of the datasets are relatively small (once divided over the 
nodes), this means that the step-size chosen needs to be rela- 
tively large. The analysis we have for the approximate steady- 
state expression in Theorem |2] assumes the use of small step- 
sizes, so we expect to see a better match between theory and 
simulation if the data sets were larger and the step-sizes were 
smaller — see Figs. l4cBdl further ahead. Better matches will oc- 
cur when smaller step-sizes are used ifisl ITil . We perform 
regularized logistic regression (|2) on the dataset in real-time 
and evaluate the network excess-risk defined in dHJ using the 
ATC, CTA, and the non-cooperative algorithms described by 
( fTTl i. (fTSl l. and (O, respectively. For the ATC and CTA algo- 
rithms, we set the gradient combination matrix C - In that 
the nodes do not exchange their gradient vectors. In addition, 
we compare the performance of our algorithm to the central- 
ized full gradient (CFG) algorithm that has access to all data 
samples from all nodes at every iteration: 

N 

WC¥G,i = WC¥G,i-\ ~J^^ VviJ^(wcFG,/-l) (CFG) (82) 

The CFG algorithm averages the gradients from all nodes and 
moves against the average gradient direction. We also com- 
pare against the semi-distributed algorithm from [6] where each 
node executes stochastic gradient descent up to some time hori- 
zon / and then the nodes transmit their estimates Wkj to a central 
processor that averages all estimates: 



1 ^ 

k=l 



(time-horizon averaging) (83) 



Notice that ( |83] ) requires some time horizon ; to be known and 
requires either some central server to average the estimates and 
redistribute the average (l87t back to the nodes or the use of 
some iterative consensus scheme ll32ll . In order to compare our 
algorithm to that of f^, we assume that the averaging occurs 
at every step of the algorithm (we only evaluate the excess-risk 
at the central processor, and do not communicate the average 
back to the nodes since the nodes' iterations do not depend on 
the averaged estimates). Finally, we also simulate algorithm 
( l20l i from ^ using a constant step-size. The same step-size 
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Figure 3: A rotating iiyper-plaiie in 2D that adjusts to separate data from two classes {+1, -1). w° indicates the optimal normal vector of the hyper-plane. 



Table 2: Properties of datasets used for performance evaluation and the problem parameters associated with the datasets. 



Dataset 


Instances 


Attributes (M) 


P 






Experiments 


alpha 


500000 


500 


5 


0.0001 


20 


20 


a9a 


32561 


123 


5 


0.02 


8 


100 


webspam 


350000 


254 


5 


0.0025/0.001 


40 


50 



is used for all algorithms. For the combination matrix A, we 
utilize the Metropolis rule ifioll to generate the coefficients: 



1-1' IMI-i 
0, 



), ieNkJ^k 
otherwise 



(84) 



The Metropolis weighting matrix A generated using ( l84l i is dou- 
bly stochastic. The loss function that each node utilizes is the 
regularized log-loss: 



Qiw,h,yd^^\\w\\ 



(85) 



where hi indicates the feature vector and indicates the true 
label (±1). In this case, the data Xk,i in (|5]l are defined as Xkj = 
{hk,i,yk,i]- The risk function is the expectation of Q{-) over the 
inputs hi and yj. In each experiment, a number of nodes are 
used to distribute the classifier learning task as listed in Table|2] 
A batch optimization, where all samples from the full dataset 
are available to the learner, was used in order to compute w° . 
This optimization was conducted using the LIBLINEAR 1 31 li- 
brary. The theoretical curves are computed using the simplified 
expressions derived in iHflsIl : 



ER,(0 



^^^riR,,k) 
AN 



(86) 



where R^,_k = IE{V/t,,(w'')V/t_,(w")^). Fig. |4] shows the excess- 
risk learning curves for the different algorithms and different 
datasets. We observe that the ATC algorithm outperforms the 
CTA algorithm and the non-cooperative algorithm (as estab- 
lished by Theorem O as well as the consensus-type algorithm 



( |20] | from 17| when the same constant step-size is used. We also 
observe from Figs.|4c]and|4d]that as the step-size decreases, the 
excess-risk also decreases. This fact is in agreement with our 
analysis in Theorem [T] We notice that the time-horizon aver- 
aging algorithm from |0] is close in performance to the ATC 
diffusion algorithm. The algorithm from |61, however, requires 
global communication at every iteration and is not a distributed 
solution as is the case with diffusion strategies. 

In order to evaluate the performance of the actual classifier 
output by the algorithms, we plot the receiver operating char- 
acteristic (ROC) curves in Fig. |5] The classifier for each of the 
algorithms is computed using: 



ji = sign(h]w - b) 



(87) 



by sweeping the bias b. In Fig.|5j Pq indicates the probability 
of detection while PpA indicates the probability of false alarm. 
Notice that the curve for the ATC algorithm is very close to that 
of the CFG algorithm and the algorithm from [6| while the ATC 
algorithm is fully distributed. The CTA and consensus algo- 
rithm from 1 7] perform worse than the ATC algorithm. We also 
see a clear performance improvement over the non-cooperative 
algorithm. Finally, as the step-size decreases for the 'webspam' 
dataset, we see that the diffusion algorithm tends to improve in 
performance and get closer to the centralized batch processing 
solution. The batch processing curve is computed by using w" 
as the separating hyperplane in (l87l i. 
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(a) Excess-risk for 'alpha' dataset 
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(b) Excess-risk for 'a9a' dataset 
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(c) Excess-risk for 'webspam" dataset {/i = 0.0025) 
Figure 4: Excess-risk learning curves for different stationary datasets (continued on the next page). 
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(d) Excess-risk for 'webspam' dataset (/j = 0.001) 
Figure 4: Excess-risk learning curves for different stationary datasets (continued from the previous page). 




(a) ROC curve for 'alpha' dataset (b) ROC curve for 'a9a' dataset 




(c) ROC curve for 'webspam' dataset (p = 0.0025) (d) ROC curve for 'webspam' dataset (p = 0.001) 



Figure 5: ROC curves for different stationary datasets. 
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6.2. N on- Stationary Environments 

6.2.1. Random Walk Rotating Hyperplane - Gradual Concept 
Drift 

In this section, we simulate a scenario where w" is a ran- 
dom walk. We do so to illustrate the analysis in Theorem |4] 
and to simulate the behavior of the algorithms under gradual 
concept drifts. In the next section, we will simulate instant con- 
cept drifts. In order to clarify the presentation of the results, 
we concentrate in this section on the ATC algorithm and the al- 
gorithm from ff'\ only since we have already established in the 
last section that the ATC algorithm outperforms CTA and non- 
cooperation. We study the algorithm from |j7|] when the step- 
size decays with time. This allows us to highlight the impor- 
tance of utilizing constant step-sizes in non-stationary environ- 
ments. We generate data for two classes {-1-1,-1) with Gaussian 
distributions Nirtij, I2) and Ni-nii, I2) respectively where m, is 
the mean of the +1 distribution at time /. We let m, be a random 
walk with increments that are Gaussian with zero mean and co- 
variance 0.01/2. We compute w° at every iteration based on all 
the data in the network using the L I BL I NEAR library [33]. Each 
of the - 200 nodes receives one sample per iteration. The 
Metropolis weights ( l84l i are used to combine the estimates for 
the ATC algorithm and the algorithm from [7]. An amount of 
10% label noise was also added to the dataset. We set the step- 
size to fj. - 0.005 and p = 0.01 for the loss function in ( l85l l. 
We use the classifier in (l87l i to obtain the classifier accuracy in 
Fig.|6al which is defined as: 



Accuracy : 



Number of correctly classified samples 
Total number of samples 



(88) 



In addition, we plot the excess-risk in Fig.|6b] We observe that 
as the target w'. changes, the diminishing step-size algorithm 
from [7] does not cope with non-stationarity. On the other hand, 
and as predicted by Theorem|4] the constant step-size algorithm 
can track these changes. 

6.2.2. STAGGER Concepts - Instantaneous Concept Drift 

In addition to the gradual concept drift simulation in the last 
section, we also simulate our algorithm on a dataset with instan- 
taneous concept drift. We use the STAGGER dataset HQ for 
this purpose. We simulate a network with = 125 nodes. All 
the nodes experience the concept change simultaneously. As 
in issll . we define the target concept to be changing over 120 
iterations, in intervals of 40 iterations for each target concept: 



(/iu = 1) and (/i/,3 = 0), 
(h^ = 0) or (lv,2 = 0.5), 
(% = 0.5)or(% = 1), 



1 < / < 40 

41 < / < 80 (89) 
81 < / < 120 



The labels are then mapped from {0, -i-l) to {-1, -i-l). The above 
rule can be seen as a numerical representation of the color, 
shape, and size attributes through the definitions in Table |3] 

An amount of 10% label noise was also added to the dataset 
at each experiment. The simulation results were averaged over 
100 experiments. A regularization factor of p = 0.1 was used 
to optimize the log-loss in dSSl l. The batch optimization was 



carried out using the LIBLINEAR library 1 13 311 . A step-size of 
fj. = 0.25 was used to simulate the constant step-size algorithms 
(ATC, CTA, non-cooperative, and f?]). In addition, 

we simulate the algorithm from [7] with a diminishing step-size 
fij = /// V/ to illustrate the necessity of constant step-sizes for 
non-stationary environments. Figure iTal shows the excess-risk 
performance of the different algorithms on the STAGGER con- 
cepts. The constant step-size algorithms continuously track the 
changing target concept while the diminishing step-size algo- 
rithm from |7] fails to do so due to the diminishing learning 
rate. Observe that the algorithm from would not know when 
the concept changed and it would have to implement a change 
detector in order to allow the central node to poll the informa- 
tion from all the nodes (or to initiate consensus iterations). We 
also evaluate the ROC curves using ( l87l l associated with the 
classifier at the last iteration of the target concept. The ROC 
curves are illustrated in Fig. |7b] The diminishing step-size al- 
gorithm is not helpful in detecting the second concept since it 
is below the chance line (Pd - Pfa)- In addition, we still notice 
that the ATC algorithm outperforms the other fully distributed 
approaches (non-cooperative, CTA, and (l20t ) and is close to the 
batch solution. Metropolis weights (|84] | are used for the com- 
bination matrix for the distributed algorithms. 

7. General Discussion 

We saw in Sec. [3] that the excess-risk of a classifier can 
be written as a weighted mean-square-error with a weight ma- 
trix chosen according to Table [1] when the step-size p is small. 
This formulation of the excess-risk allows us to study the per- 
formance of distributed algorithms and explain their behavior. 
When the environment is stationary (for example, when the 
learners are sampling from a fixed distribution), we saw that 
the ATC and CTA diffusion algorithms can achieve an excess- 
risk performance proportional to yu. In addition, we established 
that the ATC algorithm will outperform the CTA algorithm and 
non-cooperative processing when the combination matrix A is 
doubly-stochastic. This generalizes previous results that only 
applied when the loss function used in the learning process is 
quadratic [11.1 . 

When the environment is non-stationary, we modeled the 
optimizer w" to be a random walk with i.i.d. increments. This 
model allows us to study the performance of the diffusion al- 
gorithm when tracking a non- stationary random process. We 
obtained (in TheoremUJl a bound on the excess-risk that is com- 
prised of three terms: a constant term that depends on the co- 
variance matrix of the increments of the random walk process, 
a term that is proportional to p, and a term that is inversely 
proportional to p. This result is intuitive since we expect the 
diffiision algorithm to be able to track a fixed optimizer, or a rel- 
atively slow optimizer As the optimizer evolves more quickly, 
however, the algorithm must increase the step-size in order to 
become more agile. The trade-off for the tracking ability of the 
diffusion algorithm is summarized in Fig. |2] 

The simulation results illustrated that the steady-state excess- 
risk performance of the diffusion algorithm is proportional to 
the step-size p (see Fig. l4cBdb . Furthermore, we showed through 
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(a) Accuracy for the Markov random walk concept drift across time. Larger values are better. 
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(b) Excess-risk for the Markov random walk concept drift over time. Smaller values are better. 
Figure 6: Results for Markov random walk simulation. 



Table 3: Numerical Representation of STAGGER concepts 



Attribute 


Color (a, i) 


Shape (x, 2) 


Size (.v,,3 ) 


Value 


Green 


Blue 


Red 


Triangle 


Circle 


Rectangle 


Small 


Medium 


Large 


Numerical Representation 
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1 
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(b) ROC curves for the three STAGGER concepts. 



Figure 7: Results from STAGGER simulation. 
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extensive simulations that the ATC algorithm outperforms the 
consensus-based algorithm proposed in [7] when constant step- 
sizes are employed. It can be observed from Fig. |5]that the 
area under the ROC curve of the ATC algorithm is larger than 
that of the non-cooperative, consensus-based, and CTA algo- 
rithms. Furthermore, the performance of the ATC algorithm is 
seen to approach that of batch processing, especially for small 
step-sizes. In Fig.|6bl we see that a constant step-size algorithm 
can track a changing optimizer, unlike a diminishing step-size 
algorithm such as the one described in JtI • 

8. Conclusion 

We analyzed the generalization abiUty of distributed online 
learning algorithms by showing that constant step-size algo- 
rithms can have bounded network excess-risk in non-stationary 
environments. We provided closed-form expressions for the 
asymptotic excess-risk and showed the advantage of coopera- 
tion over networks. 
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Appendix A. Comparing Diffusion and Non-Cooperative 
Strategies 

Appendix A.l. CTA vs. Non-Cooperative Processing 

We confine our discussion to the following diffusion models 

C = /iv, Ai = A, A2 = In (CTA) (A.l) 
C^In, Ai = /a,, A2 = a (ATC) (A.2) 

The case of non-cooperating nodes corresponds to the choices: 

C - In, A\ - In, A2 - In (non - cooperative processing) 

(A.3) 

Our objective is to compare the network excess-risk achieved 
by the diffusion strategies and the excess-risk achieved when 
there is no cooperation between the nodes. We will conduct the 
analysis for constant step-sizes in stationary environments. To 
begin, we start from (|52] | and rewrite it as: 

E||h>,-i||^ ~ vec(J/)"^(/ - Tr^veciT) (A.4) 



where 



(A.5) 



We now perform the series expansion of (/ - ^F) ^ to get 

CO 

EWwi.if^ ^ vec(J/)^25">vec(7-) 

00 



7=0 



= vec(T')"^ Y^iSj ® &)vec(J/) 



= vec(r)"^vec(S-'J/(S^)"^) 

00 



(A.6) 



When Assumption |5] holds, we have the weighting matrix 7" 
has the form T - In ® S where S = ■^V''-J(w"). We can then 
simplify the above as: 

00 

' (A.7) 



llH-.-illr 



In addition, with Assumption|5] we have D = In® D" for some 
M X M matrix D" that is the same for all nodes, then we can 
further write: 

S ^ a] ® (Im - fiD") (A.8) 

We define the excess-risk for CTA and non-cooperative pro- 
cessing as: 



CO 

ERcTA = A*' '^''^^^^ ® ^ )4TA'^4k) (A. 10) 

;=o 

where Sqta and Sind are defined as: 

Si„i^lN®(lM-I^D") (A.ll) 

ScTA=A®(/M-yuD") (A. 12) 

Noticing that J/ is the same for CTA and the individual process- 
ing case, we compute the difference in the excess-risk as: 



ER; 



ind 



ER, 



CTA 



^ Tr((Sind(/M ® S )!Bl^ - Scta(/w ® S )SJta) J/) (A. 1 3) 



7=0 



We substitute ( IA.lll i-( IA.12b into dA.BI l. and get: 

ERind - ERcTA - 

CO 

p^ 2 Tr(((/^ - A-'A-''^) ® (Hm - poysiiM - poyw) 

(A. 14) 

Since S = :^V^J{w°) is positive-definite, we conclude that 
(Im - nOyS (Im - tJ^D°y > O. Finally, since we assumed that A 
is doubly-stochastic, then A^ is also doubly-stochastic, as well 
as AM^^. Therefore, the matrix (/-AM-'^) > and its eigenval- 
ues are in the range [0, 1] |11]. Finally, combining these facts 
with the knowledge that J/ > 0, we conclude that: 



ERcTA < ERind (small p, large C = In, l^A = 1^, Al = 1) 

(A.15) 

A similar conclusion holds for ATC. Actually, ATC outper- 
forms CTA as well, as we show next. 
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Appendix A.2. ATC vs. CTA 

In order to compare ATC to CTA, we continue our assump- 
tion that the matrix A is doubly stochastic, but we generahze 
our model for CTA and ATC from (lAJl and ^K2\ to: 

(A. 16) 
(A. 17) 



C, Ai=A, A2^In 
C, Ai^In, A2=A 



(CTA) 
(ATC) 



where we have modified the model to allow for an arbitrary 
right-stochastic matrix C. We continue from (IA.6b and rewrite 
the network excess-risk at steady-state for both CTA and ATC 



as: 



ERcTA = y,Tr(rT4^^J/cTA(4TA)^) 



ERatc - 



where 



ScTA = [Imn - 
Satc=^^[/mw-;"2)] 

J/CTA=//'9?V 

J/atc = fi^jCK^ 



(A. 18) 
(A. 19) 



(A.20) 
(A.21) 
(A.22) 
(A.23) 



Like the previous section, we assume the same risk function for 
all nodes (i.e.. Assumption |5] holds) so that D = I/^ ® D" and 
that the weighting matrix T has the form T - In ® S where 
S = r^V- J{w°). With the first assumption, we have: 

ScTA = Satc = A^ ® (/m - nD") (A.24) 

We compute the difference between the excess-risks: 

ERcta-ERatc 

CO 

= ^Tr [[A\l-AA')A^'^{lM-tiDPyS (iM-fiOyy^K) 

j=o 

We can verify that the above difference is non-negative by not- 
ing that > and (Im - 1^0°)^ S (I m - uD°y is positive-semi- 
definite. Moreover, AHI - AA^)Aj^ > lUll. Therefore, we 
have established, under our assumptions, that 



ERatc ^ ERcta 



(A.25) 



Therefore, combining this result with the result from the pre- 
vious appendix we conclude that for small fi, large /, C - In, 
l^A = 1"^, and Al = 1 



ERatc < ERcta < ER,„d 



(A.26) 



Appendix B. Mean-Square-Error Analysis 



We follow the approach of Ill3[l and extend it to handle non- 
stationary environments as well. We define the error vectors at 
node k at time / as: 



h.i = w- - (f>k,i 



(B.l) 



(B.2) 
(B.3) 



We subtract (fT2l i from ^ and (fT3])-(fT4l) from w° using ( l32l i to 
get 



Ok.i-l 



(B.4) 



i/fkj = hj-i + 9/ + y" ^ ca [V7f,/_i(<^A:,,_i) + vKi^t,,-!)] 

(B.5) 

N 



~f 



(B.6) 



Using the mean- value-theorem for real vectors (l27t . we can ex- 
press the gradient ^Jk,i-]{<Pk,i-i) in terms of 4>k.i-i' 



V7o-i(0-i,/-i)=V7f,,-i«_i)- 

— -Htxi^k,i-\ 
where we are defining 



Jo 



4>kJ-l 

(B.7) 



Ha.i= ^^Ja-i(wU-t^k.i-i)dt (B.8) 
Jo 

Notice that ^Jc,i-iiw°_^) - since the minimizer at time / - 1 
is J. Substituting (IB.7l i into ( IB .51 ). we get 



(B.9) 



Appendix B.l. Local MSE Recursions 

We now derive the mean-square-error (MSE) recursions by 
noting that the squared norm ||x|p = x^x is a convex function 
of X. Therefore, applying Jensen's inequality ['36, p. 77] to (IB. Il l 
and (IB. 3b we get: 



E||^i,,-ill'<X«i.«^lK<-ill'' k^\,...,N (B.IO) 

N 

Mwif <Yj^imWu\\\ k^\,...,N (B.ll) 
From (IB .9b and using Assumption |2] we obtain 
E|||Adl' = IE||^i,i_i|||^_+E||?,-||2+|/2]g 



^C[kVc{^k,i-\) 

t=\ 

(B.12) 

where we are introducing the weighting matrix: 



(B.13) 
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The matrices Hkj are positive semi-definite and bounded by: 



< iikj < rliM 



where 



yk ~ max 



{=1 



{=1 



(B.14) 



(B.15) 



Now note that the square of from dB.lSI l can be upper-bounded 
by: 



/ iV \ 



7l 



= max I 1 - 2;U/ln,ax ^ Ctt -H yU^/t^ax ^ 



Cfk 



f N \ 



2 ]2 ||^||2 



< 1 -2/z^™„a+/z^/i;;„J|C||t 



(B.16) 



where C* denotes the minimum absolute column sum of the 
matrix C. In order to simplify the notation in the following 
analysis, we introduce the upper-bound 



where 



A' ^(Ai,^+a)\\C\\] (B.18) 



and a is defined in Assumption |2] Also, note that by Lemma 3 
from II13II . we have: 



E 



[=1 



\\C\\l[aEMk.i-i\f + ^l] (B.19) 



Combining ( IB.14I) . ( IB.19I I. and (IB.12I ). we obtain for all A: = 
1,...,A^: 



(B.20) 



Appendix B.2. Network MSE Recursions 

We now combine the MSE values at each node into network 
MSE vectors as follows: 



.7/ l|2lT 



i2nT 



j/i^[m\>/>ij\\\...,m\M^]' 



We can then rewrite (IBTB . dRjO), and (iBJTT l as: 
J/i < ;8^,-i + (/i^llCll^cr^ + Tr(0)l« 



(B.21) 
(B.22) 
(B.23) 

(B.24) 
(B.25) 
(B.26) 



where x < y indicates that each element of the vector x is less 
than or equal to the correspondent element of vector y. More- 
over, the notation denotes the vector with all entries equal to 



one. Using the fact that if x < y then Bx < By for any matrix B 
with non-negative entries, we can combine the above inequality 
recursions into a single recursion for 'W,- and get: 



nV; < pAlA\^V,-i + {p%C\\\(Tt + Tr(0)l 



2„2 



(B.27) 



We now upper-bound the 00-norm (maximum absolute value) 
of the vector in order to obtain the scalar-recursion: 



\YWi\U < \\|3A^,A[^i^^U+^i%C\\\ai,+Tr(Q) 

<p-\\Al\U-\\A\\U-\YWi^,\U 



^?\\C\\\(rl+lx{Q) 



where ||A||co denotes the maximum absolute row sum of matrix 
A. Noting that the matrices A\ and A2 are left-stochastic, we 

■ 1. Therefore, 



have that ||A|||oo = 1 and \\A\ 
Unrolling ( IB. 28b . we get 



< y6|W_i|U + WCWlcry + Tr(0 



i-i 



IWIU < ySil^olU + {WQllcry + Tr(0) 



(B.28) 



(B.29) 
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